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Abstract — We study conditional linear information inequalities, 
i.e., linear inequalities for Shannon entropy that hold for distribu- 
tions whose entropies meet some linear constraints. We prove that 
some conditional information inequalities cannot be extended to 
any unconditional linear inequalities. Some of these conditional 
inequalities hold for almost entropic points, while others do not. 
We also discuss some counterparts of conditional information 
inequalities for Kolmogorov complexity. 

Index Terms — Information inequalities, non-Shannon-type in- 
equalities, conditional inequalities, almost entropic points, Kol- 
mogorov complexity 

[Submitted to the IEEE Transactions on Information The- 
ory.] 

I. Introduction 

In this paper we consider discrete random variables with 
a finite range. Let {x\, . . . ,x n ) be a joint distribution of n 
random variables. Then several standard information quantities 
are defined for this distribution. The basic quantities are the 
Shannon entropy of each variable H(xj), the entropies of pairs 
H(xi,Xj), the entropies of triples, quadruples, and so on. The 
other standard information quantities are the conditional en- 
tropies, e.g., H(xi\xj), the mutual informations, e.g., I(xi:xj), 
and the conditional mutual informations, e.g., I(xi:Xj \x k ). All 
these quantities can be represented as linear combinations of 
the plain Shannon entropies, e.g., 

H(xi\xj) = H(xi,Xj) — H(xj), 
I(xi-.Xj) = H(xi) + H(xj) - H(xi,xj), 
I(xi:xj\x k ) = H(xi,Xk) + H(xj,Xf.)- 
-H(xi,xj,x k ) - H(x k ). 

Thus, the values of entropies determine all other standard 
information quantities of the distribution. 

There is no linear dependencies between different entropies 



H(x il . 



but their values cannot be arbitrary, they must 



match some constraints. The most important constraints for 
entropies (of jointly distributed variables) can be written as 
linear inequalities. For instance, it is well known that for every 
distribution it holds 

H(xi,Xj) > H(xi) 
(which means that H(xj\xi) > 0), 

H(xi) + H(xj) > H(xi,Xj) 
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(which can be expressed in the standard notation as 

I(xi\Xj) > 0), and 

H(xi,x k ) +H(xj,x k ) > H(xi,Xj,Xk) + H(x k ) 

{i.e., I(xi-.Xj\xk) > in the standard notation). (Actually the 
first two inequalities above can be obtained as special cases of 
the third one.) These basic inequalities extend back to Shan- 
non's original papers on information theory. Obviously they 
remain true if we substitute the individual variables Xi , xj , x k 
by any tuples of variables. Also, any linear combination of 
these basic inequalities with positive coefficients is a valid 
inequality. The inequalities for entropies that can be obtained 
in this way (combinations of instances of the basic inequalities 
summed up with some positive coefficients) are referred to as 
Shannon-type inequalities, see, e.g., 0281 . 

Linear inequalities for entropies are "general laws" of 
information, which are widely used in information theory 
to describe limits in communication, compression, secrecy, 
etc. Information inequalities have also applications beyond 
information theory, e.g., in combinatorics and in group theory, 
see the survey in [5 |. So, it is natural to ask whether there exist 
linear inequalities for entropy that hold for all distributions but 
are not Shannon-type. N. Pippenger asked in 1986 a general 
question [24] : what is the class of all linear inequalities for 
Shannon entropy? 

The first example of a non-Shannon-type inequality was 
discovered by Z. Zhang and R.W. Yeung in 1998. Quite many 
(in fact, infinitely many) other linear inequalities for entropy 
have been found since then. It was discovered that the set 
of all valid linear inequalities for entropy cannot be reduced 
to any finite number of inequalities: F. Matus proved that for 
n > 4 the cone of all linear inequalities for n-tuples of random 
variables is not polyhedral 1201 . There is still no full answer to 
Pippenger's question - a simple characterization of the class of 
all linear inequalities for Shannon entropy is yet to be found. 

In this paper we investigate a slightly different class of "laws 
of the information theory". We consider conditional infor- 
mation inequalities, i.e., linear inequalities for entropies that 
hold only for distributions that meet some linear constraints 
on entropies. To explain this notion we start with very basic 
examples. 

Example 1: If I(a:b) = 0, then H{a) + H(b) < H(a,b). 
This follows immediately from the definition of the mutual 
information. 

Example 2: If I(a:b) — 0, then 

H(a) + H(b) + H{c) < H(a, c) + H(b, c). 
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This follows from an unconditional Shannon-type inequality: 
for all a, b, c 

H{a) + H{b) + H{c) < H(a, c) + H(b, c) + I(a:b). 

(This inequality is equivalent to the sum of two basic inequal- 
ities: H(c\a,b) > and I{a:b\c) > 0.) 

Example 3: If I(e:c\d) = I(e:d\c) — I(c:d\e) = 0, then 

I(c:d) < I{c:d\a) + I(c:d\b) + I(a:b). 

This is a corollary of the non-Shannon-type inequality 
from [15 1 (a slight generalization of the inequality proven by 
Zhang and Yeung in |[30l ), 

I(c:d) < I(c:d\a) + I(c:d\b) + I(a:b)+ 

+ [I(e:c\d) + I(e:d\c) + I(c:d\e)}, 

which holds for all (a, b, c, d, e) (without any constraints on 
the distribution). 

In these three examples we have statements of the form 

If some equalities hold for entropies of X±, . . . ,x n , then 
some inequality holds for entropies of these variables. 

This is what we call a conditional information inequality (also 
referred to as a constrained information inequality). 

In examples QJ[3] the conditional inequalities are derived 
directly from the corresponding unconditional ones. But not 
all proofs of conditional inequalities are so straightforward. In 
1997 Z. Zhang and R.W. Yeung came up with a conditional 
inequality 

if I(a:b) = I(a:b\c) = 0, then 

I(c:d) < I{c:d\a) + I{c:d\b) + I(a:b), ( } 

see (29]. If we wanted to prove ([]]) similarly to examples QJ[3] 
above, then we should first prove an unconditional inequality 

I{c:d) < I(c:d\a) + I(c:d\b) + I(a:b)+ 
+X 1 I{a:b) + X 2 I(a:b\c) 

with some "Lagrange multipliers" Ai,A2 > 0. However, the 
proof in |29| does not follow this scheme. Can we find an 
alternative proof of (Q]l that would be based on an instance 
of (f2]), for some Ai and A2? In this paper we show that this 
is impossible. We prove that, whatever the values Ai, A2, un- 
conditional inequality (O does not hold for Shannon entropy. 

Note that in ||29l it was already proven that inequality (Q~|) 
cannot be deduced from Shannon-type inequalities (the only 
linear inequalities for entropy known in 1997). We prove a 
stronger statement: (HJ cannot be deduced directly from any 
unconditional linear inequalities for entropy, Shannon-type or 
non-Shannon-type, known or yet unknown. 

Since conditional inequality (Q~|) cannot be extended to any 
unconditional inequality ©, we call Zhang- Yeung's inequality 
essentially conditional. (The formal definition of an essen- 
tially conditional linear information inequality is given in 
Section imi) 

We show that several other inequalities are also essentially 
conditional in the same sense, i.e., cannot be deduced directly 
from any unconditional linear inequality. Besides Zhang- 
Yeung's inequality discussed above, we prove this property 



for a conditional inequality from [ 1 8 1 and for three conditional 
inequalities implicitly proven in [20|. Also we construct one 
new conditional inequality and show that it is essentially 
conditional. 

It turns out that essentially conditional information inequal- 
ities can be divided into two classes: some of them hold only 
for entropic points ( i.e., points representing entropies of some 
distributions), other hold also for almost entropic points (limits 
of entropic points). Conditional inequalities of the second 
type (those which hold for almost entropic points) provide an 
intuitive geometric meaning. We show that the very fact that 
such inequalities exist, implies the seminal theorem of Matus's: 
the cone of linear information inequalities is not polyhedral. 
The geometric and physical meaning of the inequalities of 
the first type (that hold for all entropic but not for all almost 
entropic points) remains obscure. 

Linear information inequalities can also be studied in the 
framework of Kolmogorov complexity. For unconditional lin- 
ear information inequalities the parallel between Shannon's 
and Kolmogorov's settings is very prominent. It is known 
that the class of unconditional linear inequalities for Shan- 
non entropy coincides with the class of unconditional linear 
inequalities for Kolmogorov complexity (that hold up to an 
additive logarithmic term), J9)- Conditional information in- 
equalities also can be considered in the setting of Kolmogorov 
complexity, though the parallelism between Shannon's and 
Kolmogorov's conditional inequalities is more complicated. 
We show that three essentially conditional information in- 
equalities are valid, in some natural sense, for Kolmogorov 
complexity (these are the same three conditional inequalities 
proven to be valid for almost entropic points), while two 
other conditional inequalities (which hold for entropic but 
not for almost entropic points) do not hold for Kolmogorov 
complexity. For another essentially conditional inequality the 
question remains open, we do not know whether it holds for 
almost entropic points and/or for Kolmogorov complexity. 

Some results of this work were presented in conference 
papers ifTUI . |[T2l and in a preprint ifTTIl . 

Remark 1: It seems that essentially conditional information 
inequalities are not a usual object to study. Nevertheless, 
conditional inequalities are widely used in information the- 
ory as a helpful tool. For instance, conditional information 
inequalities are involved in proofs of conditional independence 
inference rules, see fl6l . IfTTIl . Il27ll . Also, lower bounds for 
the information ratio in secret sharing schemes are based on 
conditional information inequalities, JT|, Q, lF23l . However, 
in most applications (e.g., in all applications in secret sharing 
known to the authors) the used conditional inequalities are just 
straightforward corollaries of unconditional inequalities, like 
in examples [T}{3] above. So, in most papers the attention is not 
focused on the fact that a conditional inequality for entropies 
is employed in this or that proof. 

A. Preliminaries 

Here we define some notation (mostly standard) used in the 
paper. 
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a) Shannon entropy: All random variables in this paper 
are discrete random variables with a finite range. Shannon 
entropy of a random variable £ is defined as 

« Pa 

where p a = Prob[£ = a] for each a in the range of £. 

b) Entropy profile: Let A" = {xi, . . . ,x n } be n jointly 
distributed random variables. Then each non-empty tuple 
X C X is itself a random variable to which can be associated 
Shannon entropy H(X). The entropy profile of X is defined 
as the vector 

H(X) = (H(X))^ XQX 

of entropies for each non-empty subset of random variables, 
in the lexicographic order. 

c) Entropic points: A point in R 2 " -1 is called entropic 
if it is the entropy profile of some distribution X. A point is 
called almost entropic if it belongs to the closure of the set of 
entropic points. 

d) Linear information inequality: A linear information 
inequality for n-tuples of random variables is a linear form 
with 2™ — 1 real coefficients (cx)$=£x<zx such that for all 
jointly distributed random variables X = {x%, . . . , x n } 

cxH(X) > 0. 

e) Box notation: To make formulae more compact, we 
shall use the notation from |[20l : 

□ab.cd := I(c:d\a) + I(c:d\b) + I(a:b) - I(c:d). 
B. Results 

Here we briefly formulate our main results and explain the 
structure of the paper. 

1. Non trivial proofs of conditional inequalities. In ex- 
amples QJO discussed above, we presented a trivial way to 
deduce some conditional information inequalities from the 
corresponding unconditional one. But some known proofs of 
conditional inequalities do not follow this naive scheme. In 
Section [II] we systematize the known nontrivial proofs of 
conditional linear information inequalities. 

Basically, two different techniques are known. The first 
technique was proposed by Z. Zhang and R.W. Yeung in 
1 29 1, where they came up with the first nontrivial conditional 
inequality. Later this technique was employed by F. Matus in 
1 1 8 1 to prove another conditional inequality. In this paper we 
use a similar argument to prove one more inequality. Each of 
these three inequalities involves four random variables. We 
denote these inequalities (XI), (12), and (13) respectively; 
their validity constitute Theorem Q] in Section HII In this paper 
we prove only the new inequality (13) and refer the reader for 
the proofs of (XI) and (12) to ll29l and ifTsl respectively. 

Three other conditional inequalities are proven implicitly by 
Matus in |20|. A different technique is used in this proof; it is 
based on approximation of a conditional inequality with some 



infinite series of unconditional onefl We formulate this result 
of Matus's in Theorem E] (and give an explicit proof). In what 
follows we denote these inequalities (14-16). Each of these 
three inequalities involves five random variables. Identifying 
some variables in these inequalities, we obtain as a corollary 
two nontrivial conditional inequalities (14'— 15') with four 
variables, which are of independent interest (see below). 

The main advantage of the technique in Theorem [2] is that 
(14-16) and (1A'-Ih r ) can be proven not only for entropic 
but also for almost entropic points. 

2. Essentially conditional inequalities. In Section [Til] we in- 
troduce the central notion of this paper - we define essentially 
conditional linear information inequalities (inequalities which 
cannot be obtained as a restriction of any unconditional linear 
inequality for entropies). In Theorem [3] we prove the main 
technical result of the paper: inequalities (11-16) and (14'- 
15') are essentially conditional. 

As we mentioned above, (14-16) hold for almost entropic 
points. This is not the case for (11) and (13). In Theorem [5] 
of Section IIV-AI we prove that (11) and (13) are valid only 
for entropic but do not for almost entropic points. We actually 
prove that there exist almost entropic points that satisfy all 
unconditional linear inequalities for entropy but do not satisfy 
conditional inequalities (II) and (13). 

For (12) the question remains open: we do not know 
whether it is true for almost entropic points. 

3. Geometric meaning of essentially conditional inequalities. 
In Section llV-Bl we discuss the geometric meaning of inequali- 
ties that hold for almost entropic points. We explain that each 
of the inequalities (14—16) implies the Matus theorem: the 
cone of all (unconditional) linear inequalities for entropies 
is not polyhedral if n > 4 random variables are involved. 
Basically, our proof is a more geometric explanation of the 
ideas that appear in the original argument by Matus in ll20l . 

4. Conditional inequalities for Kolmogorov complexity. In 
Section [V] we discuss essentially conditional inequalities for 
Kolmogorov complexity. In Theorem [8] we prove that some 
counterparts of (14-16) are valid for Kolmogorov complexity; 
in Theorem [9] we show that natural counterparts of (11) and 
(13) do not hold for Kolmogorov complexity. 

Proofs of some technical lemmas are moved to Appendix. 

II. Nontrivial Conditional Inequalities 

A. The first approach: Zhang-Yeung 's method 

Most known proofs of non-Shannon-type information in- 
equalities are based on the method proposed by Z. Zhang 
and R.W. Yeung in [29 1 and [30], see a general exposition in 
[8 1, [28 1 . The basic ideas of this method can be explained as 
follows. Let us have n jointly distributed random variables. We 
take the marginal distributions for the given random variables 
and use them to assemble one or two new distributions. The 
new distributions may involve more than n random variables 
(we can take several "copies" of the same components from 
the original distribution). Then we apply some known linear 

'The unconditional inequalities involved in the proof are actually derived 
themselves by Zhang- Yeung's method, so the first technique is also implicitly 
used here. 
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information inequality to the constructed distributions, or use 
the non-negativity of the Kullback-Leibler divergence. We 
express the resulting inequality in terms of entropies of the 
original distribution, and get a new inequality. 

This type of argument was used in [ 10), 0~8), ll29ll to prove 
some conditional information inequality. We unite these results 
in the following theorem. 

Theorem 1: For every distribution (a, b, c, d) 

(11) If I(a:b\c) = I(a:b) = 0, then 
I(c:d) < I(c:d\a) + I(c:d\b), 

(12) if I(a:b\c) = I(b:d\c) = 0, then 

O a b,cd > 0, 

(13) if I(a:b\c) = H(c\a, b) = 0, then 

E^ab.cd > 0. 

Remark 2: Inequality (11) can also be written as O abyCd > 
since O ab , cd = I(c:d\a) + I(c:d\b) — I{c:d) under the 
restriction I(a:b) — 0. 

Inequality (11) was proven by Z. Zhang and R.W. Yeung 
in 11291 , (12) was proven by F. Matus in [18|, and (13) was 
originally obtained in 1101 . In this paper we prove only (13). 

Proof of (13): The argument consists of two steps: 
'enforcing conditional independence' and 'elimination of con- 
ditional entropy'. Let us have a joint distribution of random 
variables a, b, c, d. The first trick of the argument is a suitable 
transformation of this distribution. We keep the same distri- 
bution on the triples (a, c, d) and (b, c, d) but make a and 
b independent conditional on (c, d). Intuitively it means that 
we first choose at random (using the old distribution) values 
of c and d; then given fixed values of c, d we independently 
choose at random a and b (the conditional distributions of a 
given (c, d) and b given (c, d) are the same as in the original 
distribution). More formally, if p(a, b,c,0) is the original 
distribution, then the new distribution p' is defined as 



p{a,c,d) -p(b,c,d) 
p(c,d) 



(for all values (o, b, c,D) of the four random variables). With 
some abuse of notation we denote the new random variables 
by a! . b' , c, d. From the construction (a' and b' are independent 
given c, d) it follows that 

H(a',b', c, d) = H{c, d) + H(a'\c, d) + H(b'\c, d) 

Since (a',c,d) and (b',c,d) have exactly the same distribu- 
tions as the original (a, c, d) and (6, c, d) respectively, we have 

H(a\ b', c, d) = H(c, d) + H{a\c, d) + H{b\c, d) 

The same entropy can be bounded in another way: 

H(a', b 1 , c, d) < H{d) + H{a'\d) + H(b'\d) + H{c\a', b') 

Notice that the conditional entropies H(a'\d) and H(b'\d) are 
equal to H(a\d) and H(b\d) respectively (we again use the 
fact that a', d and b' , d have the same distributions as a, d and 
b, d in the original distribution). Thus, we get 

H(c, d) + H(a\c, d) + H(b\c, d) < 

H(d) + H{a\d) + H{b\d) + H(c\a', b') 



It remains to estimate the value H(c\a' , b'). We will show that 
it is zero (and this is the second trick used in the argument). 

Here we will use the two conditions of the theorem. We say 
that some values o, c (b, c or a, b respectively) are compatible 
if in the original distribution these values can appear together, 
i.e.,p(a, c) > (p(b,c) > or p(a, b) > respectively). 
Since a and b are independent given c, if some values a and b 
are compatible with the same value c, then these a and b are 
compatible with each other. 

In the new distribution (a', b\ c, d), values of a' and b' are 
compatible with each other only if they are compatible with 
some value of c; hence, these values must also be compatible 
with each other for the original distribution (a, b). Further, 
since H(c\a, b) = 0, for each pair of compatible values of 
a, b there exists only one value of c. Thus, for a random pair 
of values of (a', b') with probability one there exists only one 
value of c. In a word, in the new distribution H(c\a! , 6') = 0. 

Summarizing our arguments, we get 

H(c, d) + H(a\c, d) + H(b\c, d) < H(d) + H{a\d) + H(b\d), 
which is equivalent to 

I{c:d) < I(c:d\a) + I(c:d\b) + I(a:b) ■ 

B. The second approach: a conditional inequality as a limit 
of unconditional ones 

Another group of conditional inequalities was proven im- 
plicitly in ||20l . In this argument a conditional inequality is 
obtained as a limit of some infinite family of unconditional 
inequalities. 

Theorem 2 (F. Matiis): For every distribution (a, b, c, d, e) 

(14) if I(a:d\c) = I(a:c\d) = 0, then 
n ab:Cd + I(a:c\e) + I(a:e\c) > 0, 

(15) if I{b:c\d) = I(c:d\b) = 0, then 
D ab . cd + I(b:c\e) +I{c:e\b) > 0, 

(16) if I{b:c\d) = I(c:d\b) = 0, then 
D abtCd + I{c:d\e)+I{c:e\d) > 0. 

These inequalities hold not only for entropic but also for 
almost entropic points. 

Proof: The following series of unconditional inequalities 
were proven in [20] for all k — 1, 2, . . . 



(ii) 



(Hi) 



□ah.cd + I(a:c\e) + I(a:e\c) + -I(c:e\a)+ 

+ ^-{I{a:d\c) + I{a:c\d)) >0, 
a ab}Cd + I(b:c\e) + I{c:e\b) + -7(6:e|c)+ 



k- 1 



(I(b:c\d) + I(c:d\b)) > 0, 



□afc, c tf + I(c:d\e) + I(c:e\d) + -I(d:e\c)+ 
+ ^-(I(b:c\d)+I(c:d\b))>0 



(inequalities (i), (ii), and (Hi) correspond to the three claims 
of Theorem 2 in [20], for CD equal to □ 1 4 2 3, □i3 i 24, and Di2,34 
respectively). 
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The constraints in (14), (15) and (16) imply that the terms 
with the coefficient ==i in (i), (ii) and (Hi) respectively are 
equal to zero. The terms with the coefficient \ vanish as k 
tends to infinity. So in the limit we obtain from (i-iii) the 
required inequalities (14-16). 

Note that linear inequalities (i), (ii), (Hi) hold for all points 
in the cone of almost entropic points. Hence, the limits (14- 
16) are also valid for almost entropic points. 

■ 

We can restrict this theorem to 4-variable distributions and get 
simpler and more symmetric inequalities: 

Corollary 1: For every distribution (a, b, c, d) 

(14') if I(a:d\c) = I(a:c\d) = 0, then D ab . cd > 0, 
(lb 1 ) if I(b :c\d) = l\c:d\b) = 0, then U ab . cd > 0. 
These conditional inequalities are also valid for almost en- 
tropic points. 

Proof: Inequality (14') is a restriction of (14) for the 
cases when e — d. Inequality (15') can be obtained as a 
restriction of (15) or (16') for e = d and e = b respectively. 



C. A direct proof of (14') 

In this section we show that conditional inequality (14') 
from Corollary Q] can be proven by a simpler direct argument. 
In what follows we deduce (14') (for entropic points) from 
two elementary lemmas. 

Lemma 1 (Double Markov Property): Assuming that 

I(x:z\y) = I(y:z\x) = 0, 

there exists a random variable w (defined on the same proba- 
bilistic space as x, y, z) such that 

. H(w\x) = 0, 
. H(w\y) = 0, 

• I(z:x, y\w) = 0. 

This lemma is an exercise in J7J; see also Lemma 4 in fl5l . 

Remark 3: If I(x:z\y) = I(y:z\x) = and I(z:x, y\w) = 
0, then I(x:y\w) = I(x:y\w, z). Thus, we get from LemmaQ] 
a random variable w that functionally depends on x and on y, 
and I(x:y\w) = I(x:y\w, z). 

Lemma 2: Assume there exists a random variable w such 
that 

. H(w\x) = 0, 
. H[w\y) = 0, 

• I(x:y\w) = I(x:y\w, z). 

Then O VZtXy > 0. 

Proof of lemma [3} Since H(w\x) = H(w\y) = 0, we 

get 

I(x:y) = H(w) + I(x:y\w), 

where the value I(x:y\w) can be substituted by I(x:y\w, z). 
Further, for any triple of random variables v, w, z we have 

H(w) < H{w\v) + H(w\z) + I(z:v). 

Then, we use again the assumption H(w\x) = H(w\y) = 
and get 

H(w\z) + I(x:y\w, z) = I(x:y\z) 



and 

H(w\v) < I(x:y\v). 

The sum of these relations results in D vz , xy > 0. ■ 
Inequality (14') follows immediately from these two lemmas 
(substitute v, z, x, y with a, b, c, d respectively). 

Remark 4: This argument employs only Shannon-type in- 
equalities. The single step which cannot be reduced to a 
combination of basic inequalities is the lemma about Double 
Markov property: the principal trick of this proof is the 
construction of a new random variable w in Lemma Q] 

Remark 5: The proof presented in this section is simpler, 
but the more involved argument explained in Section lTl-Bl gives 
much more information about (14'). Indeed, the argument 
based on approximation with unconditional linear inequalities 
implies that (14') holds not only for entropic but also for 
almost entropic points. 

III. Why Inequalities are Essentially conditional 

The nature of inequalities (11-16) is quite different from 
the inequalities discussed in Examples 1-3. The peculiarity 
of (11-16) is that they cannot be deduced directly from any 
unconditional inequality. In general, we say that an inequality 
with linear constraints is essentially conditional if it cannot be 
extended to any unconditional inequality where conditions are 
added with "Lagrange multipliers". Let us define this class of 
inequalities more precisely. 

Definition 1: Let a(X) and /3x(X), , . . ,/3 m (X) be linear 
functions on entropies of X = (xx, . . . , x n ) 

a(X) = £ a x H(X), 
f3i(X) = £ Pi, x H(X), i=l...m 

such that the implication 

(Pi(X) = for all i = 1, . . . , m) => a(X) > 

holds for all distributions X. We call this implication a 
conditional linear information inequality. 

This conditional inequality is said essentially conditional if 
for all (Ki)i<i< m the inequality 

m 

a{X)+J2^{X)>0 

i=l 

does not hold (for some distribution). 

Examples Q}[3] in Introduction and statements (11-16) are 
instances of conditional information inequalities. 

Theorem 3: Inequalities (11-16) and (1A'-15') are essen- 
tially conditional. 

Remark 6: From the fact that (X4'-I5') are essentially 
conditional, it follows immediately that (14-16) are also 
essentially conditional. Thus, it remains to prove the theorem 
for (11-13) and (14'-15 r ). 
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A. Proof of Theorem [5] based on binary ( counterexamples 
Claim 1. For any n the inequality 

I(c:d) < I(c:d\a) + I(c:d\b) + n(I{a:b\c) + I(a:b)) (3) 

does not hold for some distributions (a,b,c,d), i.e., (II) is 
essentially conditional. 

Proof: For all e £ [0,1], consider the following joint 
distribution of binary variables (a, b, c, d): 



Prob[a = 


0, b = 


0, 


c = 


0, d = 


1] 


= (1 


-e)/4, 


Prob[a = 


0, 6 = 


1, 


c = 


0, d = 


0] 


= (1 


-e)/4, 


Prob[a = 


1, 6 = 


0, 


c = 


0, d = 


1] 


= (1 


-e)/4, 


Prob[a = 


1, 6 = 


1. 


c = 


0, d = 


1] 


= (1 


-e)/4, 


Prob[a = 


1, 6 = 


0, 


c = 


1, d = 


1] 




e. 



For each value of a and for each values of b, the value of at 
least one of variables c, d is uniquely determined: if a = 
then c = 0; if a = 1 then d = 1; if b = then d = 1; and 
if b = 1 then c = 0. Hence, I(c:d\a) = I{c:d\b) = 0. Also 
it is easy to see that I(a:b\c) = 0. Thus, if © is true, then 
I(c:d) < nl(a:b). 

Denote the right-hand and left-hand sides of this inequality 
by L(e) = I(c:d) and R(e) = nl(a:b). Both functions L(e) 
and R(e) are continuous, and 7(0) = R{0) = (for e = 
both sides of the inequality are equal to 0). However the 
asymptotics of L(e) and R(e) as s — > are different: it is not 
hard to check that L(e) = 6(e), but R{e) = 0{e 2 ). From © 
it follows 9(e) < 0(e 2 ), which is a contradiction. ■ 

Claim 2. For any k the inequality 

7(c:d) < I{c:d\a) + I(c:d\b) + I(a:b)+ 

+ K {I(a:b\c) + H(b:d\c)) <4) 

does not hold for some distributions (a,b,c,d), i.e., (12) is 
essentially conditional. 

Proof: For the sake of contradiction we consider the 
following joint distribution of binary variables (a, b, c, d) for 
every value of e G [0, |]: 



Prob[a 


= 0, 6 = 


0. 


c = 0, d = 


0] 


= 3e, 


Prob[a 


= 1, b = 


1, 


c = 0, d = 


o] 


= 1/3 -e 


Prob[a 


= 1, b = 


0. 


c = 1, d = 


0] 


= 1/3 -e 


Prob[a 


= 0, 6 = 


1. 


c = 0, d = 


1] 


= 1/3 -e 



We substitute this distribution in dU and obtain 

Io + 0(e) <I Q + 3eloge + O(e) + 0(/se), 

where Jo is the mutual information between c and d for e = 
(which is equal to the mutual information between a and b for 
s = 0). We get a contradiction as e — » . ■ 
Claim 3. For any k the inequality 

J(c:eO < J(c:d|o) + Z(c:d|6) + I(a:b)+ 

+ K(I(a:b\c)+H(c\a,b)) ° ] 

does not hold for some distributions (a,b,c,d), i.e., (13) is 
essentially conditional. 

Proof: For every value of e G [0, |] we consider the 
following joint distribution of binary variables (a, 6, c, d): 



Prob[a 


= 1, 6 = 


1. 


c = 0, d = 


0] 


= 1/2 -E 


Prob[a 


= 0, 6 = 


1, 


c = 1, d = 


0] 


= e, 


Prob[a 


= 1, 6 = 


0. 


c = 1, d = 


o] 


= e, 


Prob[a 


= 0, 6 = 


0. 


c = 1, d = 


1] 


= 1/2 -e 



First, it is not hard to check that 7(c:d|a) = I(c:d\b) = 
H(c\a, b) = for every e. Second, 

7(a:i>)=l + (2 - 2/ln2)e + 2eloge + 0(e 2 ), 
7(c:d)=l + (4 - 2/ ln2)e + 2e loge + 0(e 2 ), 

so I(a:b) and 7(c:d) both tend to 1 as e — > 0, but their 
asymptotics are different. Similarly, 

7(a:6|c) = 0(e 2 ). 

It follows from © that 

2e + 0(e 2 ) < 0(e 2 ) + 0(Ke 2 ), 

and with any k we get a contradiction for small enough e. ■ 
Claim 4. For any k the inequality 

□afc.cd + «[7(a:c|d) + 7(a:d|c)] > (6) 

does not hold for some distributions, i.e., (14') is essentially 
conditional. 

Proof: For all e e [0, j], consider the following joint 
distribution of binary variables (a, b, c, d): 



Prob[a 


= 0, b = 


0, 


c 


= 0, d = 


0] 


Prob[a 


= l,b = 


1, 


c 


= 0, d = 


0] 


Prob[a 


= 0,b = 


1. 


c 


= 1, d = 


0] 


Prob[a 


= l,b = 


1, 


c 


= 1, d = 


0] 


Prob[a 


= 0,b = 


0, 


c 


= 0, d = 


1] 


Prob[a 


= l,b = 


0, 


c 


= 0, d = 


1] 



For this distribution, the lhs of inequality §6^ rewrites to 

2 

Dab,cd + K[I(a:c\d) + 7(<z:d|c)] = - j^e 2 + 0(ne 3 ), 

which is negative for e small enough. ■ 
Claim 5. For any k the inequality 

n abtCd + K [I(b:c\d) + I{c:d\b)] >0 (7) 

does not hold for some distributions, i.e., (15') is essentially 
conditional 

Proof: For all e G [0, 4], consider the following joint 
distribution for (a, 6, c, d): 



Prob[a 


= 0, b = 


0, 


c = 0, d = 


0] 


? J 

2 


Prob[a 


= 0,b = 


1, 


c=0, d = 


1] 


Prob[a 


= l,b = 


0, 


c = 1, d = 


0] 


= £j 


Prob[a 


= l,b = 


1, 


c = 0, d = 


0] 


= e. 



For this distribution we have 7(c:d|a) = 7(c:d|6) = I(a:b) = 
0, and 

I{c:d) = e + 0{e 2 ) and I{b:c\d) = 0{e 2 ). 
Therefore, the lhs of inequality rewrites to 

□ab.cd + n[I{b:c\d) + 7(c:d|6)] = -e + (9(k£ 2 ), 
which is negative for e small enough. ■ 
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B. Geometric ( counterexample and another proof of Theo- 
rem\3\for (11) and (IS) 

Consider a random quadruple (a,b,c,d) q of geometric 
objects on the affine plane over a finite field ¥ q : 

• First, choose a random non-vertical line c defined by the 
equation y = Co + c\x (the coefficients cq and C\ are 
independent random elements of the field); 

• then pick independently and uniformly two points a and 
6 in line c (technically, a = (01,02) and b = (61,62), 
where <Zj and 6j are elements of ¥ q ; since the points are 
chosen independently, they coincide with each other with 
probability 1/g); 

• pick uniformly at random a parabola d in the set of all 
non-degenerate parabolas y = d + d\x + d 2 x 2 (where 
c? , d-i, d 2 € ¥ q , d 2 7^ 0) that intersect c at points a and 6 
(if a — b then c is the tangent line to d at 6). 

A typical quadruple is shown on Figure 1 . By the construction, 




d 



Fig. 1. A typical configuration of random objects (a,b,c,d). 

if the line c is known, then points a and b are independent, 
i.e., 

I(a:b\c) = 0. 

Similarly, a and 6 are independent when d is known, and 
also c and d are independent given a and given 6 (when 
an intersection point is given, the line does not give more 
information about the parabola), i.e., 

I(a:b\d) = I{c:d\a) = I(c:d\b) = 0. 

The mutual information between c and d is approximately 
1 bit because randomly chosen line and parabola intersect iff 
the discriminant of the corresponding equation is a quadratic 
residue, which happens almost half of the time. A more 
accurate calculation gives I(c:d) = . 

When a and 6 are known and a 7^ 6, then c is uniquely 
defined (the only line incident with both points). If a = b 
(which happens with probability 1/g) we need logq bits to 
specify c. Hence, 

H(c\a, b) = . 



To estimate I(a:b) we note that a is uniformly distributed 
on F?. When 6 = (61,62) is known, then with probability 1/q 
we have a = b, and with probability 1 — 1/q the value of a 
is uniformly chosen among q(q — 1) values (a±, 02) such that 
ai ^ 61. Hence, 7(a:6) = 

Now we compute both sides of the inequality 

J(c:d) < I(c:d\a) + I(c:d\b)+ 

+ Kil(a:b) + K2l(a:b\c) + K^,H{c\ab) 

and get 

1 logo logo 

1 < K\ h K 3 . 

9 9 9 

This leads to a contradiction for large q. It follows that (II) 
and (13) are essentially conditional. 

C. Stronger form of Theorem\3\for inequalities (T\) and (13) 

The construction in Section IIII-BI implies something 
stronger than Theorem[3]for inequalities (11) and (13). Indeed, 
next proposition follows from (II) and from (13): 

I{a:b\c) = I(a:b\d) = H(c\a,b) = I(c:d\a) = 

= I(c:d\b) = I(a:b) = => I(c:d) = 0. 

We claim that this conditional (in)equality is also essentially 
conditional: 

Theorem 4: There is no constant k such that for all random 
variables a, b, c, d 

I(c:d) < n[I{c:d\a) + I(c:d\b) + I(a:b) + 

+ I(a:b\c) + I(a:b\d) + H(c\a, b)]. 

Proof: For the quadruple (a, 6, c, d) q form the geometric 
example defined in Section IIII-BI each term in the right-hand 
side of the inequality vanishes as q tends to infinity, but the 
left-hand side does not. ■ 
Remark 7: The conditional (in)equality above is weaker 
than (11) and (13); so, Theorem @] is somewhat stronger than 
Theorem [3] 

IV. Inequalities for Approximately Entropic 
Points 

Let us remind that a point in R 2 " -1 is called entropic if it 
represents the entropy profile of some distribution; a point is 
called almost entropic if it belongs to the closure of the set of 
entropic points. It is known that for every n the set of almost 
entropic is a convex cone in R 2 _1 , see [28|. The set of all 
almost entropic points is exactly the set of points that satisfy 
all unconditional linear information inequalities. Though some 
almost entropic points do not correspond to any distribution, 
we will abuse notation and refer to coordinates of an almost 
entropic point as H(xi), H(xi,Xj), etc. Moreover, we keep 
the notation I(xi-.Xj), I(xi-.Xj\xk), etc. for the corresponding 
linear combinations of coordinates of these points. 
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A. Some conditional inequalities fail for almost entropic points 

We have seen that (14-16) hold for almost entropic points. 
This is not the case for two other essentially conditional 
inequalities. 

Theorem 5: Inequalities (XI) and (13) do not hold for 
almost entropic points. 

Proof: The main technical tool used in our proof is 
Slepian-Wolf coding. We do not need the general version of 
the classic Slepian-Wolf theorem, we use only its special case 
(actually this special case makes the most important part of 
the general proof of the standard Slepian-Wolf theorem, see 
Theorem 2 in the original paper [26 1 and a detailed discussion 
in section 5.4.4 of @). 

Lemma 3 (Slepian-Wolf coding): Let X, Y be N indepen- 
dent copies of random variables x, y, i.e., X — (xi, . . . , xn), 
Y = (j/i, . . . , yw), where pairs (xi,yi) are i.i.d. Then there 
exists X' such that 

. H(X'\X) = 0, 

. H(X') = H(X\Y) + o(N), 

. H(X\X',Y) = o{N). 
(This Lemma is also a special case of Theorem 3 in 12T].) 
Lemma [3] claims that we can construct a hash of a random 
variable X which is almost independent of Y and has approx- 
imately the entropy of X given Y. We will say that X' is the 
Slepian-Wolf hash of X given Y and write X' = SW{X\Y). 

Construction of an almost entropic counterexample for (II): 

1) Start with distribution (a, b, c, d) q defined in Section llll-BI 
The value of q is specified in what follows. For this 
distribution it holds I(a:b\c) = but I(a:b) ^ 0. So 
far, distribution (a, b, c, d) q does not satisfy the conditions 
of (II). 

2) Serialize it: define a new quadruple (A, B, C, D) such 
that each entropy is TV times greater. (A, B, C, D) is ob- 
tained by sampling N times independently (aj, 6j, Cj, di) 
according to the distribution (a, 6, c, d) and letting A = 
(at,..., aw), B = (bi, . . . , b N ), C = (c 1( . . . , c N ), and 
D = (di,...,d N ). 

3) Apply Slepian-Wolf coding Lemma and define A' — 
SW(A\B), then replace in the quadruple A by A'. 

The entropy profile for A' , B,C,D cannot be far differ- 
ent from the entropy profile for A, B, C, D. Indeed, by 
construction, H(A'\A) = and H(A\A') < I(A:B) + 
o(N). Hence, the difference between entropies involving 
(A',B,C,D) and (A,B,C,D) is at most I(A:B) + 
o(N) =o(^f-N\. 

Notice that I(A':B\C) = since A' functionally depends 
on A and in the initial distribution I(a:b\c) = 0. 

4) Scale down the entropy profile of (A' , B,C, D) by a 
factor of 1/N. More precisely, if the entropy profile of 
(A' , B,C, D) is some point h G M 15 , then for every 
e > there exists another distribution (A", B" , C" , D") 
with an entropy profile h! such that \\h' — jjh\\ < e. 
This follows from convexity of the set of almost entropic 
points (The new distribution can be constructed explicitly, 
see Theorem 14.5 in |28|). We may assume that e = 1/N. 

5) Tend N to infinity. The resulting entropy profiles tend to 
some limit, which is an almost entropic point. This point 



does not satisfy (II) (for q large enough). Indeed, on one 
hand, the values of I(A":B") and I{A":B"\C") con- 
verge to zero. On the other hand, inequality I(C":D") < 
I{C":D"\A")+I(C":D"\B") results in 

1 _ O < O , 

which can not hold for large enough q. 

Construction of an almost entropic counterexample for 
inequality (Xi): In this construction we need another lemma 
based on Slepian-Wolf coding. 

Lemma 4: For every distribution (a, b, c, d) and every inte- 
ger N there exists a distribution (A' , B',C',D') such that the 
following three conditions hold. 

. H(C'\A'^B') =o{N). 

• Denote h the entropy profile of (a, b, c, d) and h! the 
entropy profile of (A', B' , C , D'); then the components 
of h! differ from the corresponding components of N ■ h 
by at most N ■ H(c\a, b) + o(N). 
m Moreover, if in the original distribution I(a:b\c) = 0, 
then I(A':B'\C) = o(N). 
(This lemma is proven in Appendix.) Now we construct an 
almost entropic counterexample to (13). 

1) Start with the distribution (a, b, c, d) q from Section UlI-BI 
(the value of q is chosen later). 

2) Serialize (a, 6, c, d) q , i.e., construct (A, B, C, D) by sam- 
pling independently N copies of distribution (a,b,c,d). 

3) Apply Lemma 0] and get (A', B' , C , D') such that 
H(C'\A',B') = o(N). Lemma S] guarantees that other 
entropies of (A' , B' ,C , D') are about N times larger 
then the corresponding entropies for (a, b, c, d), possibly 
with an overhead of size 

0(N ■ H(c\a, 6)) = O (^^ N ^J ■ 

From the last bullet of Lemma [4] we also have that 
I(A':B'\C') = o(N). 

4) Scale down the entropy point of (A' , B' , C , D') by the 
factor of 1/N within precision of 1/N, similarly to step 
(4) in the construction above. 

5) Tend N to infinity to get an almost entropic point. 
Conditions of (13) are satisfied for I(A':B'\C) and 
H(C'\A',B') both vanish in the limit. Inequality (13) 
reduces to 

i-o(^j <o(^). 

which can not hold if q is large enough. 

■ 

The proven result can be rephrased as follows. There exist 
almost entropic points that satisfy all unconditional linear 
inequalities for entropies but do not satisfy conditional inequal- 
ities (11) and (13) respectively. 

Note that one single (large enough) value of q suffices to 
construct almost entropic counterexamples for (II) and (13). 
However the choice of q in the construction of Theorem [5] 
provides some freedom: we can control the gap between the 
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left-hand side and the right-hand side of the inequalities. By 
increasing q we can make the difference between the left-hand 
side and the right-hand in the inequalities of (II) and (13) 
greater than any given factor. 

B. The cone of almost entropic points is not polyhedral 

Theorem 6 (F. Matus, R2U\l ): For n > 4 the cone of al- 
most entropic points for distributions of rt-tuples of random 
variables is not polyhedral. Equivalently, the cone of (uncon- 
ditional) linear inequalities for the entropies of 4-tuples of 
random variables is not polyhedral. 

Proof: We prove the theorem for n = 4. For the sake 
of contradiction we assume that the cone of almost entropic 
points in K 15 is polyhedral. That is, the set of almost entropic 
points is the set of solutions for some finite system of linear 
inequalities 

/i>0,...,/ s >0, 

where each fj is a linear functional on R 15 . 

The constraints I(a:d\c) = I(a:c\d) = specify a facet (of 
co-dimension 2) on the boundary of the cone. The correspond- 
ing conditional inequality (14') specifies a non-degenerate 
linear functional which is non-negative on the corresponding 
face. Technically, this functional is defined by the linear form 

g = I{c:d\a) + I(c:d\b) + I(a:b) - I(c:d). 

We will show that this linear functional can be extended to 
the entire space ]R 15 as 

g = g + Kil(a:d\c) + K2l(a:c\d) 

so that the resulting linear form g 1 is nonnegative on the poly- 
hedral cone. This will imply a contradiction with Theorem [3] 
We change the coordinate system. Instead of the standard 
coordinates (xi, . . . , a; 15) corresponding to the entropy values 

(H(a),H(b),...,H(a,b,c,d)) 

we introduce another coordinate systems (j/i, • • • , J/15) such 
that yi = I(a:d\c) and y 2 = I(a:c\d). The choice of 
t/3 , ... , y 15 is not important, we only require that the trans- 
formation 

G : (xi,...,xis) i-> (2/1, • - - ,2/15) 

is linear and not degenerate. 

Inequality (14) can be reformulated as follows: if J/i = J/2 = 
then g > (i.e., linear functional g can be represented as a 
linear form g = 031/3 + . . . + 0152/15). On the other hand, in 
the new coordinate system we can represent each functional 

fj as 

fj = Oj.iyi + a i,22/2 + • ■ ■ + Oj-,152/15 

Restrictions of each fj onto the subspace J/i = j/ 2 = can be 
specified by 13 real coefficients (instead of 15). Denote 

fj = %-,3j/3 + aj.AVi + • • • + a,\i5j/i5- 

We know that for all points y = (2/3, J/15) such that 
fj(y) > for j = l,...,s, the inequality g(y) > holds. 
It follows from Farkas' lemma that for some reals Cj > 

g(y) = c 1 f[(y) + ... + c s f'M- 



From the definition of /' we get 

g(y) = ci(/i-ai,ij/i-ai, 2 J/2) + - • •+c s (/ s -a s ,ij/i-a s , 2 J/2)- 

This is an identity for linear forms, so their coordinate repre- 
sentations must be equal to each other. Hence, the forms with 
these coordinate representations are equal to each other on the 
entire M 15 . Coming back to the original system of coordinates, 
we obtain 

I(c:d\a) + I{c:d\b) + I(a:b) ~ I(c:d) = 
= Cjfj — Kil(a:d\c) — K2l(a:c\d) 

for some constants K\ and k^. The sum ^ Cjfj is non-negative 
on the entire cone of almost entropic points since all fj by 
definition are non-negative on this cone. Thus, we get the 
inequality 

I{c:d) < I(c:d\a) + I{c:d\b) + I(a:b)+ 

+ Kil(a:d\c) + K2l(a:c\d), 

which must be true for all distributions (a,b,c,d). This con- 
tradicts Theorem [3] (Claim 4), and we are done. ■ 
Remark 8: The argument above works mutatis mutandis for 
every essentially conditional linear information inequality for 
the cone of almost entropic points, with constraints that specify 
some "face" of this cone. In particular, it works for (Z4-I6). 
The original proof in ll20l corresponds to this argument with 
inequality (15'). 

V. Conditional inequalities for Kolmogorov 

COMPLEXITY 

The Kolmogorov complexity of a finite binary string x is 
defined as the length of a shortest program that prints x on the 
empty input; similarly, the conditional Kolmogorov complexity 
of a string x given another string y is defined as the length 
of a shortest program that prints x given y as an input. More 
formally, for any programming language C, the Kolmogorov 
complexity Cc(x\y) is defined as 

Cc(x\y) = min{|p| : program p prints x on input y}, 

and unconditional complexity Cc (x) is defined as complexity 
of x given the empty y. The basic fact of Kolmogorov 
complexity theory is the invariance theorem: there exists a 
universal programming language U such that for any other 
language C we have Cu{x\y) < Cc{x\y) (the 0(1) depends 
on C but not on x and y). We fix such a universal language 
U; in what follows we omit the subscript U and denote the 
Kolmogorov complexity by C(x), C(x\y). We refer the reader 
to |[T4l for an exhaustive survey on Kolmogorov complexity. 

We introduce notation for the complexity profile of a tuple 
of strings similar to the notation for entropy profile (defined 
in section iFAl) . Let X — {x\, . . . ,x n ) be n binary strings; 
then Kolmogorov complexity is associated to each subtuple 
(xjj , . . . , Xi k ), and we define complexity profile of X as 

C{X) = [C{xi^ , . . . , a;j fe ))i<i 1 <...<i i .<n, 

i.e., a vector of complexities of 2" — 1 non-empty tuples in 
the lexicographic order. We also need a similar notation for 
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the vector of conditional complexities: if X — (x\, . . . , x n ) 
is an n-tuple of binary strings and y is another binary string, 
then we denote 

C( x \y) = (C(x h) . . . ,x ih \y))i<i 1< .,. <ih < n . 

Kolmogorov complexity was introduced in [ 1 3 1 as an algo- 
rithmic version of measure of information (in an individual 
constructive object). It is natural to expect that the basic 
properties of Kolmogorov complexity are similar in some 
sense to the properties of Shannon entropy. We have indeed 
many examples of parallelism between Shannon entropy and 
Kolmogorov complexity. For example, for the property of 
Shannon entropy H(x, y) = H(x) + H(x\y) (where x and y 
are random variables) there is a counterpart in the Kolmogorov 
setting C(x,y) = C{x) + C(x\y) + 0(log(|x| + \y\)) (where 
x and y are binary strings). This statement for Kolmogorov 
complexity is called the Kolmogorov-Levin theorem, see OTI . 
This result justifies the definition of the mutual information, 
which is an algorithmic version of Shannon's standard defini- 
tion: the mutual information between binary strings is defined 
as 

I(x:y) :=C{x)+C{y)-C{x lV ), 
and the conditional mutual information is defined as 

I(x:y\z) C(x, z) + C(y, z) - C(x, y, z) - C(z). 

From the Kolmogorov-Levin theorem it follows that I(x:y) 
is equal to C(x) — C(x\y), and the conditional mutual in- 
formation I(x:y\z) is equal to C(x\z) — C(x\y, z) (all these 
equalities hold only up to logarithmic terms). 

Actually a much more general similarity between Shannon's 
and Kolmogorov's information theories is known. For every 
linear inequality for Shannon entropy there exists a counterpart 
for Kolmogorov complexity. 

Theorem 7 ( f^j): For each family of coefficients {Aw} the 
inequality 

hH(ai) + A, ,//;</,. </, | + . . . > 

i i<j 

is true for every distribution [at) if and only if for some 
constant n the inequality 

Y h.C{xi) + y^ y \jjC(xi,Xj) + . . . + n\og N > 

i i<j 

is true for all tuples of strings (x^ (N denotes the sum of the 
lengths of all strings Xi, constant k does not depend on strings 

Xi). 

Thus, the class of unconditional inequalities valid for Shannon 
entropy coincides with the class of (unconditional) inequalities 
valid for Kolmogorov complexity. 

Can we extend this parallelism to the conditional informa- 
tion inequalities? It is not obvious how to even formulate con- 
ditional inequalities for Kolmogorov complexity. The subtle 
point is that in the framework of Kolmogorov complexity we 
cannot say that some information quantity exactly equals zero. 
Indeed, even the definition of Kolmogorov complexity makes 
sense only up to an additive term that depends on the choice of 
the universal programming language. Moreover, such a natural 



basic statement as the Kolmogorov-Levin theorem holds only 
up to a logarithmic term. It makes no sense to say that I(a:b) 
or I(a:b\c) exactly equals zero, we can only require that 
these information quantities are negligible comparative to the 
lengths of strings a, b, c. So, if we want to prove a meaningful 
conditional inequality for Kolmogorov complexity, the linear 
constraints for information quantities must be formulated with 
some reasonable precision. 

In this section we prove two different results. One of these 
results (Theorem [8) is positive. It shows that some reasonable 
counterparts of inequalities (14-16) hold for Kolmogorov 
complexity. The other one (Theorem |9]l is negative. It claims 
that natural counterparts of inequalities (14) and (16) do not 
hold for Kolmogorov complexity 

A. Positive result: three conditional inequalities make sense 
for Kolmogorov complexity 

Now we show that inequalities from Theorem [2] can be 
translated in some sense in the language of Kolmogorov com- 
plexity. In this section we use the notation Dab.cd (combination 
of mutual informations, see Section II-Ab for Kolmogorov 
complexities. 

Theorem 8: Let f(n) be any function of an integer argu- 
ment. Then there exists a k > such that for every tuple of 
binary strings (a, b, c, d, e) 



(Z4-Kolm) if I(a:d\c) < f(n) and I(a:c\d) < f(n), then 



O a b,cd + 


I(a:c\e) 


+ I(a:e\c) + n^f n ■ 


f(n) 


(J5-Kolm) if I(b:c\d) 


</(") 


and I(c:d\b) < f(n) 


, then 


O a b,cd + 


I(b:c\e) 


+ I(c:e\b) + n^J n ■ 


fin) 


(T6-Kolm) if I(b:c\d) 


</(") 


and I(c:d\b) < f(n) 


, then 



□ab.cd + I(c:d\e) + I(c:e\d) + n^n ■ f(n) > 0, 
where n is the sum of lengths of strings a, b, c, d, e. 
In this theorem f(n) plays the role of the measure of "pre- 
cision" of the constraints. Technically the statement of the 
theorem is true for any f(n), but it is interesting only for 
f{n) = o{n) (and f{n) = i7(logn), since different definitions 
of the mutual information in algorithmic information theory 
are equivalent to each other with only logarithmic precision). 
For example, assuming I(a:d\c) — O (y/n) and I(a:c\d) = 
O {\Jn) we get 

□ab,cd + /(a:c|e) + I(a:e\c) + O (n 3 / 4 ) > 0. 

Proof: By Theorem [7] for every linear inequality for 
Shannon entropy there exists a counterpart for Kolmogorov 
complexity that is true for all binary strings up to an additive 
0(log n)-term. Thus, from inequality (i) on page [4] (which 
holds for the Shannon entropies of any distribution) it follows 
that a similar inequality holds for Kolmogorov complexity. 
More precisely, for each integer k > there exists a constant 
D such that for all strings a, b, c, d, e 

□ab,cd + I(a:c\e) + I(a:e\c) + -I(c:e\a) + 
k — 1 

+ — — (I(a:d\c) + I(a:c\d)) + Dlogn > 0. 
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We choose k that minimizes the sum of jl(c:e\a) and 
2^(I(a:d\c) + I(a:c\d)). The value i/(c:e|a) is bounded by 
0(n/k) since all strings are of length at most n; the values 
I(a:d\c) and I(a:c\d) are less than f(n). Let k = y/n/f(n). 
Then we get 



Dyjn log n (for some D > 0). Hence, we get 



□a & ,cd + J(o:c|e) + I(a:e\c) + 0(y/n ■ f(n)) > 0, 

and (I4-Kolm) is proven. Conditional inequalities (I5-Kolm) 
and (I6-Kolm) can be proven by a similar argument. ■ 
Theorem [H] involves 5-tuples of strings (a, b, c, d, e) but 
it implies a nontrivial result for quadruples of strings. By 
assuming e = d we get from Theorem [8] the following 
corollary. 

Corollary 2: Let f(n) be a function of an integer argument 
such that f(n) < n. Then for every tuple of binary strings 

(a, b, c, d) 

(I4'-Kolm) if I(a:d\c) < f(n) and I(a:c\d) < f{n), 
then D abjCd + 0(y/n-f(n)) > 0, 

(I5'-Kolm) if I(b:c\d) < f(n) and I(c:d\b) < f(n), 
then U ab>cd + 0(y/n-f(n)) > 0, 

where n is the sum of the lengths of all strings involved. 

In Theorem [8] and Corollary [2] we deal with two dif- 
ferent measures of precision: f(n) in the conditions and 
0{-\fn ■ f(n)) in the conclusions. These two measures of 
precision are dramatically different. Assume, for example, that 
I(b:c\d) and I(c:d\b) are bounded by O(logn), which is 
the most natural conventional assumption of "independence" 
in algorithmic information theory. Then from Corollary [8] it 
follows that Dab.cd + 0(\/n\ogn) > 0. Can we prove the 
same inequality with a precision better than (3(y 'nlogn)? The 
answer is negative: the next proposition shows that this bound 
is tight. 

Proposition 1: For some n > 0, for infinitely many in- 
tegers n there exists a tuple of strings (a, b, c, d) such that 

C(a,b,c,d) = n, I(b:c\d) = O(logn) and I(c:d\b) = 
O(logn), and 

□ab,cd < —Ky/n logra. 

Proof: Let us take the distribution from Claim 5, p. |6] 
for some parameter e, and denote it (a, (3, 7, S) e . Further, we 
apply the following simple lemma from (25). 

Lemma 5 ( /|25]/): Let (a, (3, 7, 8) be a distribution on some 
finite set M A , and n be an integer. Then there exists a tuple 
of strings (a, b, c, d) of length n over alphabet A4 such that 

C(a, b,c,d) = n- H{a, /3, 7, 8) +0(\M\ log n) 

From this lemma we get a tuple of strings (a, 6, c, d) such 
that the quantities I(c:d\a), I(c:d\b), I(a:b) are bounded by 
O(logn), while 

I(c:d) = e(en) 



and 



I{b:c\d) = 0{e 2 n). 



It remains to choose appropriate e and n. Let e = y ^fp- 
Then J(6:c|d) = 0(e 2 n) = O(logn) and I(c:d) > 



□ob.cd < -Dy/n logn + O(logn), 

and we are done. Keeping in mind details of the construction 
from Claim 5 we can let here D = 1 (though the precise value 
of D does not much matter for this proof). ■ 



B. Negative result: two conditional inequalities are not valid 
Kolmogorov complexity 

The following theorem shows that counterparts of (11) and 
(12) do not hold for Kolmogorov complexity. 

Theorem 9: (a) There exists an infinite sequence of tuples 
of strings (a,b,c,d) n such that the lengths of all strings 

a,b 7 c,d are 6(n), I(a:b) = O(logrc), I(a:b\c) — O(logn), 
and 

I(c:d) - I(c:d\a) - I{c:d\b) = tt(n) 

(b) There exists an infinite sequence of tuples of strings 
(a,b,c,d) n such that the lengths of all strings a,b,c,d are 
9(n), C(c\a,b) = O(logn), I(a:b\c) = O(logn), and 

I(c:d) ~ I(c:d\a) - I{c:d\b) - I{a:b) = Q(n). 

The proof of this theorem is similar to the proof of Theo- 
rem|5] Instead of the Slepian-Wolf theorem we use Muchnik's 
theorem on conditional descriptions: 

Theorem 10 (Muchnik, R22\l ): For all strings x, y, there ex- 
ists a string z such that 

. \z\ = C{x\y), 

. C{z\x) = O(logn), 

. C(x\y,z) = O(logn), 

where n = C(x,y). We denote this string z by Much{x\y). 

Proof of Theorem ®a): We start with the distribution 
defined in Section IIII-B1 the value of q is specified in what 
follows. Let us denote this distribution (a, /?, 7, 8). Then we 
apply Lemma [3] and construct strings a, b, c, d such that 

C(a,b,c 7 d) = n ■ H(a, 0,7,8) + O(logn). 

Note that for the constructed a, b, c, d 

I(c:d) > I(c:d\a) + I(c:d\b) + I(a:b), 

and I(a:b\c) — O(logn). But this quadruple of strings does 
not satisfy the requirements of the theorem since I(a:b) is 
much greater than logn. Thus, we need to transform a, b, c, d 
so that (i) we keep the property □afc.eci <C 0, (ii) I(a:b\c) 
remains logarithmic, and (iii) I(a:b) becomes logarithmic. To 
this end, we only need to modify the string a. 

We apply Theorem [TOl for a and b and get a' = Much(a\b) 
such that 

. C(a'\a) = O(logn), 
. C(a') = C(a\b) + 0(logn), 
. C(a\a',b) = 0(logn). 
It follows immediately that 

. C(a\a') = I(a:b) + 0(\ogn), 
. I(a':b) = O(logn), 

• I(a' :b\c) = 0(log n) (since for the original tuple we have 
I{a:b\c) = O(logn)). 
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Intuitively a' is the "difference" a minus b. In what follows 
we investigate the quadruple (a',b,c,d). 

The complexity profile of (a',b,c,d) cannot be far dif- 
ferent from the complexity profile of (a,b,c,d). Indeed, 
C(a'\a) = O(logn) and C(a\a') = I(a':b) + O(logra). 
Hence, the difference between corresponding components of 
complexity profiles C(a' ,b,c,d) and C(a,b,c,d) is at most 
I(a:b) + 0(logn) = O (n ■ 

For the constructed tuple (a',b,c,d) we have I(a':b) = 
O(logn) and I(a':b\c) — O(logn). On the other hand, 

I(c:d) = n-(l-i)+0(logn), 
I{c:d\b) = n-o(^) +0(logn), 

and 

I{c:d\a') < I{c:d\a) + C{a\a') 

= oL log (^)) +0(logn). 

Thus, for large enough q we get 

I(c:d) - I(c:d\a!) - I(c:d\b) = Q(n). 

■ 

Proof of Theorem ®b): We again use the distribution 
from Section IIII-BI (the value of q is chosen later). Let us 
denote this distribution (a, (3,^,5) Then we apply to this dis- 
tribution Lemma|5]and obtain a quadruple of strings (a, b, c, d) 
such that 

C{a, b, c,d) = n ■ H(a,f3,-f,5) + O(logn). 
For the constructed a, b, c, d 

□ab.cd < ~cn 

(for some real c > 0) and I(a:b\c) — O(logn). However, this 
quadruple of strings does not satisfy the requirements of the 
theorem since H(c\a, b) is much greater than logn. It remains 
to modify a, b, c, d so that (i) we keep the property t2 a b,cd <C 0, 
(ii) I(a:b\c) remains logarithmic, and (iii) H(c\a,b) becomes 
logarithmic. 

We apply Theorem [TOl to the constructed strings (a,b,c,d) 
and get x = Much(c\a, b) such that 

. C(x\c) = O(logn), 

. C(x) = C(c\a,b) + 0(logn), 

• C(x\a, b, x) = O(logn). 
Let us consider the conditional complexity profile 
C(a, b, c, d\x). It is not hard to check that the components of 
C(a,b,c,d\x) differ from the corresponding components of 
C(a, b, c, d) by at most C(x) + 0(log n). Moreover, we have 
I(a:b\c, x) = O(logn) and C(c\a, b, x) = O(logn). Thus, 
we would like to "relativize" a, b, c, d given x as an oracle. 
To this end, we apply the following lemma. 

Lemma 6: For a string x and an m-tuple of strings y = 
(yi, . . . , y m ) there exists an m-tuple Z — (z\, . . . , z m ) such 
that 

C{Z) = C(y\x)+0(\ogN), 
where N = C(y 1: . . . ,y m ). 



(See the proof in Appendix.) From Lemma [6] it follows that 
there exists another tuple of strings (a' ,b' , d , d') such that 

C{a', b',c', d') = C(a, b, c, d\x) + 0(log n), 

where n = C(a, b, c, d, x). 

For this tuple: C{c'\a',b') = O(logn) and I{a':b'\c) = 
O(logn). On the other hand, 

I{c':d') = n-0(n\og{^) +logn) , 

I(a':b') = o(n\og(^) +\ogn) , 

I(c':d'\b') = O (n log ( + log n\ , 

I{c':d'\a') = ohilogfetfA +lognJ . 

Hence, for large enough q we get 

I(c':d') - I(c':d'\a') - I(c':d'\b') - I(a':b') = Q(n). 



VI. Conclusion 

In this paper we discussed several conditional linear inequal- 
ities for Shannon entropy and proved that these inequalities 
are essentially conditional, i.e., they cannot be obtained by 
restricting an unconditional linear inequality to the corre- 
sponding subspace. We proved that there are two types of 
essentially conditional inequalities: some of them hold for 
almost entropic points, while the other hold only for entropic 
points. We discussed the geometric meaning of inequalities 
of the first type — every essentially conditional inequality 
for almost entropic points corresponds to an infinite family 
of unconditional linear inequalities, and implies the fact that 
the cone of almost entropic points is not polyhedral. We 
also proved that some (but not all) essentially conditional 
inequalities can be translated in the language of Kolmogorov 
complexity. 

Many questions remain unsolved: 

« Does inequality (12) hold for almost entropic points? 

• All known essentially conditional inequalities have at 
least two linear constraints. Do there exist any essen- 
tially conditional inequalities with a constraint of co- 
dimension 1? 

• For the proven inequalities (I4'-Kolm) and (I4'-Kolm), 
if the constraints hold up to O(logn), then the inequality 
holds up to 0{\/n log n). Do there exists any essentially 
conditional inequality for Kolmogorov complexity with 
(9(log n)-terms both in the constraints and in the resulting 
inequality? 

• The essentially conditional linear inequalities that do not 
hold for almost entropic points remain unexplained. Do 
they have any geometric, physical meaning? 

• Matus proved that for n > 4 random variables the cone of 
all almost entropic points is not polyhedral, i.e., cannot 
be represented as an intersection of a finite number of 
half-spaces. Can we represent this cone as an intersection 
of countably many tangent half-spaces? 

• It would be interesting to establish the connection be- 
tween (i) conditional information inequalities that hold 
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for almost entropic points, (ii) infinite families of uncon- 
ditional linear inequalities, and (iii) non linear informa- 
tion inequalities. (In (4) a family of linear inequalities 
from [20 1 was converted into a quadratic inequality for 
entropies. This result seems to be an instance of a more 
general phenomenon.) 
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VII. Appendix 

Proof of lemmaU} We say that two pairs of values (ci, 5i) 
and (C2,$2) are equivalent if the conditional distributions on 
a given conditions c = Ci & d = i>i and c = c-2 & d = 1)2 are 
the same, i.e., for every o 

Prob[a = o|c = Ci, d = 0i] = Prob[a = a|c = C2, d= 02]- 

The class of equivalence corresponding to random (c, d) is 
also a random variable. We denote it w. 

From the definition of w it follows that conditional distri- 
bution on a given w and given (c, d, w) are exactly the same. 
Hence, I(a:c, d\w) — (c, d cannot give more information on 
a than w). 

Further, from I(a:d\c) = it follows that w functionally 
depends on c and from I(a:c\d) = it follows that w 
functionally depends on d (the conditional distributions on a 
given c = Ci, given d = Di, and given (c = Ci & d = Di) are 
all the same). Thus, H(w\c) = H(w\d) =0. ■ 
Proof of Lemma \4j; First we serialize (a, b, c, d), i.e., we 
take M i.i.d. copies of the initial distribution. The result is a 
distribution (A, B, C, D) whose entropy profile is exactly the 
entropy profile of (a, b, c, d) multiplied by M. In particular, 
we have I(A:B\C) — 0. Then, we apply Slepian-Wolf coding 
(Lemma O and get a Z = SW(C\A, B) such that 

. H(Z\C) = 0, 

. H(Z) = H{C\A,B)+o{M), 

. H{C\A,B,Z) = o(M). 
The entropy profile of the conditional distribution of 
(^4, B, C, D) given Z differs from then entropy profile of 
(A, B, C, D) by at most H(Z) = M ■ H{c\a, b) + o(M) 
(i.e., the difference between H(A) and H(A\Z), H(B) and 
H(B\Z), etc. is not greater than H(Z)). Also, if in the original 
distribution I(a:b\c) = 0, then I{A:B\C,Z) = I{A:B\C) = 
0. 

We would like to "relativize" (^4, B, C, D) conditional on 
Z and get a new distribution for a quadruple (A' , B' ,C , D') 
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whose unconditional entropies are equal to the corresponding 
entropies of (A, B, C, D) given Z. This "relativization" proce- 
dure is not straightforward since for different values of Z, the 
corresponding conditional distributions on (A, B, C, D) can 
be very different. The simplest way to overcome this obstacle 
is the method of quasi-uniform distributions proposed by Chan 
and Yeung in [S\ . 

Definition 2 (Chan-Yeung, fi3]l): A random variable u dis- 
tributed on a finite set U is called quasi-uniform if the proba- 
bility distribution function of u is constant over its support (all 
values of u have the same probability). That is, there exists 
c > such that Prob[u = u] € {0, c} for all u € U. A set 
of random variables (xi, . . . , x n ) is called quasi-uniform if 
for any non-empty subset {ii, . . . ,ik} C {!,... ,n) the joint 
distribution [xi x , . . . , Xi k ) is quasi-uniform. 

In (3][Theorem3.1] it is proven that for every distribution 
(A, B,C, D, Z) and every 6 > there exists a quasi-uniform 
distribution (A", B" , C", D" , Z") and an integer k such that 

\\H(A,B,C,D,Z)-rH(A",B",C",D",Z")\\ < S. 
k 

For a quasi-uniform distribution for all values 3 of Z" the 
corresponding conditional distributions of (A" , B" ,C" , D") 
have the same entropies, which are equal to the corresponding 
conditional entropies. That is, entropies of the distributions of 
A", B", (A",B"), ... given Z = 3 are equal to H(A"\Z), 
H{B"\Z), H(A",B"\Z), ... Thus, for a quasi-uniform dis- 
tribution we can perform a "relativization" as follows. We 
fix any value 3 of Z" and take the conditional distribution 
on (A",B",C",D") given Z" = 3. In this conditional 
distribution the entropy of C" given {A" ,B") is not greater 
than 

k ■ (H(C\A, B,Z) + 6) = k- (S + o{M)). 

If 5 is small enough, then all entropies of (A" , B" , C" , D") 
given Z" = 3 differ from the corresponding components of 

kM ■ H(a,b,c,d) by at most H(Z") < kM ■ H{c\a,b) + 
o{kM). 

Moreover, the mutual information between A" and B" 
given (C", Z") is the same as the mutual information between 
A" and B" given only C" , since Z functionally depends on 
C. If in the original distribution I(a:b\c) = 0, then the mutual 
information between (A",B") given (C",Z") is o(fcM). 

We choose 5 small enough (e.g., S = 1/M) and let 
(A' , B',C , D') be the constructed above conditional distribu- 
tion. ■ 
Proof of lemma [6} We are given a tuple of strings y = 
(3/1, ... j Dm)- For every subset of indices W — . . . , ik} 
we denote yw — (z/iu • • • , Vi h ), assuming 1 < %\ < ■ ■ ■ < 
i k < to. 

Let S be the set of all tuples y' = (y[, . . . , y' m ) such that 
farallZ7,Vc{l,...,m} 

c{y'u\x) < c{y v \x) and c(^|^,x) < c(yu\y v ,x). 

In particular, this set contains the initial tuple y. Further, for 
each U C {1, . . . , m} we denote 



and 



log 



the maximal size of section of S 
for some fixed [/-coordinates 



E.g., if U = {1}, then hu\ is equal to the number of all 
strings y[ such that for some strings y' 2 ,...,y' m the tuple 
(y[, y' 2 , . . . , y' m ) belongs to S. Similarly, is by definition 
equal to 

max #{(y' 2 ,...,y' m ) s.t. {y' x , y 2 , . . . , y' m ) e S}. 

Vl 

The cardinality of S is less than 2 C ^ X ' +1 since for each 
tuple y' in S there exists a program of length at most C(y\x) 
which translates x to y'. Similarly, for each U 

hu < c{y v ) + 1 and hjj < c(y \yu) + 1, 

where U = {1, . . . , m} \ U. 

On the other hand, the cardinality of S cannot be less than 
2C(y\x)-o{\ogN) ^ smce we can specify y given x by the list 
of numbers hjj and hjj and the ordinal number of y in the 
standard enumeration of S. 

We need only O (log AT) bits to specify all numbers hu and 
hjj (the constant in this 0(log N) term depends on to but not 
on N). Given all hu and hjj we can find some set 5" such 
that the sizes of all projections and maximal sections of S' are 
equal to the corresponding hu and hjj. Such sets must exist 
(e.g., there exists set S, which satisfy all these conditions), so 
we can find one such set by brute-force search. Note that we 
do not need to know x to run this search. 

For each tuple Z = (zi, . . . , z m ) in S' we have 



C(Z) < h {h 



C(y\x) +0 (log N) 



since we can specify this tuple by the list of all hu and hjj 
and the index of Z in the list of elements S'. Similarly, for 
each set of indices U 



and 



c(z u )<h u = c(y u \x) + o(i) 



c(z \Zu) <hjj = c(y \yu,x) + 0(1). 



(8) 



Let Z = (zi, . . . , z m ) be some tuple in S' with maximal 
possible complexity. Then C{Z) = C(y\x) + O(logAT) 
(complexity of Z cannot be much less since the cardinality 
of S' is 2 c ^l :E )- ( l0 6 JV )). For this tuple Z, inequality © 
becomes an equality 

CiZu) <hu = C(y v \x) + OQogN). 

Indeed, if C(Zu) is much less than hu, then 

C(Z) = C(Z V ) + C(Z \Zu) + OQogN) 

is much less than 

c{y\x) = c{y v \x) + c(y \yu,x) + OQogN), 

and we get a contradiction with the choice of Z. Hence, the 
difference between the corresponding components of complex- 
ity profiles C{Z) and C{y\x) is bounded by (log AT). ■ 



hu = 



log 



the size of the projection 
of S onto [/-coordinates 



